Annual  Conference  of  the  Prognostics  and  Health  Management  Society  2014 


Qualitative  Event-Based  Fault  Isolation  under  Uncertain 

Observations 

Matthew  Daigle1,  Indranil  Roychoudhury2,  and  Anibal  Bregon3 

1  NASA  Ames  Research  Center,  Moffett  Field,  California,  94035,  USA 
matthew.j.  daigle  @  nasa.gov 

2  SGT  Inc.,  NASA  Ames  Research  Center,  Moffett  Field,  California,  94035,  USA 

indranil.roychoudhuiy@nasa.gov 

3  Department  of  Computer  Science,  University  of  Valladolid,  Valladolid,  Spain 

anibal  @  inf  or.  uva.  es 


Abstract 

For  many  systems,  automatic  fault  diagnosis  is  critical  to  en¬ 
suring  safe  and  efficient  operation.  Fault  isolation  is  per¬ 
formed  by  analyzing  measured  signals  from  the  system,  and 
reasoning  over  the  system  behavior  to  determine  which  faults 
have  occurred,  based  on  models  of  predicted  faulty  behav¬ 
ior.  For  dynamic  systems,  reasoning  may  be  performed  using 
qualitative  analysis  of  the  differences  between  measured  sig¬ 
nals  and  their  predicted  values,  in  which  observations  take 
the  form  of  qualitative  symbols.  Such  an  approach  is  quick 
to  isolate  faults,  but  depends  critically  on  correct  generation 
of  the  qualitative  symbols  from  the  signals.  In  this  paper,  we 
develop  an  approach  to  qualitative  event-based  fault  isolation 
for  dynamic  systems  that  is  robust  to  incorrect  qualitative  ob¬ 
servations.  Observations  are  treated  as  uncertain,  where  mul¬ 
tiple  interpretations  of  an  observation,  each  with  its  own  prob¬ 
ability,  are  considered.  By  interpreting  observed  symbols  in  a 
probabilistic  manner,  the  approach  degrades  gracefully  as  the 
number  of  incorrectly-generated  symbols  increases.  The  ap¬ 
proach  is  demonstrated  on  an  electrical  power  system  testbed, 
and  experiments  using  real  data  obtained  from  the  hardware 
demonstrate  the  improved  fault  isolation  performance  in  the 
presence  of  incorrect  symbol  generation. 

1.  Introduction 

For  many  systems,  automatic  fault  diagnosis  is  critical  to 
ensuring  safe  and  efficient  operation.  Within  fault  diagno¬ 
sis,  the  task  of  fault  isolation  is  concerned  with  an  analy¬ 
sis  of  observed  behavior  in  order  to  determine  which  fault 
has  occurred.  In  many  approaches,  observations  are  trans- 
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formed  into  a  discrete  symbolic  (e.g.,  qualitative)  form  over 
which  reasoning  can  be  performed  fPuig,  Quevedo,  Escobet, 
&  Pulido,  2005;  Koscielny  &  Zakroczymski,  2000).  For  dy¬ 
namic  systems,  these  discrete  observations  take  the  form  of 
events  (Daigle,  Koutsoukos,  &  Biswas,  2009). 

In  qualitative  fault  isolation,  residual  signals  are  computed 
as  the  differences  of  observed  behavior  and  predicted  nomi¬ 
nal  behavior  (Mosterman  &  Biswas,  1999).  Deviations  of  the 
residual  signals  are  then  abstracted  into  symbolic,  qualitative 
representations,  called  fault  signatures,  to  facilitate  diagnos¬ 
tic  reasoning  (specifically,  +,  -,  and  0  symbols,  represent¬ 
ing  increase,  decrease,  and  no  change  from  nominal,  respec¬ 
tively).  Fault  models  describe  the  potential  sequences  of  fault 
signatures  produced  by  faults,  forming  a  qualitative  event- 
based  fault  isolation  approach  (Daigle  et  al.,  2009).  Such 
an  approach  is  quick  to  isolate  faults,  but  depends  critically 
on  correct  generation  of  these  qualitative  fault  signatures. 
When  the  transformation  from  observed  quantitative  signals 
into  observed  qualitative  fault  signatures  does  not  produce  the 
correct  result,  the  wrong  information  will  be  used  to  isolate 
faults,  and  this  incorrect  signature  generation  will,  therefore, 
lead  to  incorrect  diagnoses. 

In  this  paper,  we  develop  an  obser\’ation-robust  approach  to 
qualitative  event-based  fault  isolation  for  dynamic  systems  as 
an  extension  and  generalization  of  the  approach  in  (Daigle 
et  al.,  2009).  Here,  observation-robust  means  that  the  ap¬ 
proach  is  still  successful,  to  some  degree,  when  encounter¬ 
ing  incorrect  observations  (henceforth,  by  obsen’ation  we 
mean  the  version  of  the  quantitative  signal  transformed  into 
a  qualitative  symbol).  By  considering  the  qualitative  obser¬ 
vations  as  uncertain,  and  interpreting  them  in  a  probabilis¬ 
tic  manner,  the  approach  degrades  gracefully  as  the  number 
of  incorrectly-generated  symbols  increases.  The  approach  is 
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demonstrated  on  the  Advanced  Diagnostics  and  Prognostics 
Testbed  (ADAPT)  (Poll  et  al.,  2007)  an  electrical  power  sys¬ 
tem  testbed  that  has  served  as  a  benchmark  diagnostic  system 
in  the  diagnostics  community  (Poll  et  al.,  2011;  Sweet,  Feld¬ 
man,  Narasimhan,  Daigle,  &  Poll,  2013).  Using  real  experi¬ 
mental  data  obtained  from  the  ADAPT  hardware,  we  demon¬ 
strate  the  improved  fault  isolation  performance  in  the  pres¬ 
ence  of  incorrect  symbol  generation. 

Several  previous  works  have  used  probabilistic  solutions  for 
different  tasks  of  the  fault  diagnosis  problem.  In  (Ricks  & 
Mengshoel,  2009)  the  authors  use  Bayesian  Networks  (BNs) 
to  represent  probabilistic  multi-variate  models,  which  are  ap¬ 
plied  to  the  ADAPT  hardware,  as  we  do  in  this  paper.  Other 
works  have  also  applied  BNs  or  Dynamic  BNs  (DBNs)  for 
fault  diagnosis,  e.g.,  in  (Pernestal,  2009)  the  author  uses 
DBNs  to  improve  the  diagnosis  of  automotive  vehicles,  and 
in  (Alonso-Gonzalez,  Moya,  &  Biswas,  2011;  Roychoud- 
hury,  2009;  Roychoudhury,  Biswas,  &  Koutsoukos,  2010) 
DBNs  are  used  for  fault  diagnosis.  In  all  these  cases,  the 
probabilistic  solutions  are  used  to  model  the  systems  un¬ 
der  conditions  of  uncertainty  and  then  to  perform  diagnosis. 
However,  more  sources  of  uncertainty  appear  in  the  fault  di¬ 
agnosis  process  due  to,  for  example,  improper  threshold  se¬ 
lections  or  incorrect  symbol  generation.  Our  approach  in  this 
paper  uses  a  model  based  on  physical  equations  of  the  system, 
and  performs  fault  diagnosis  using  this  model.  The  proba¬ 
bilistic  methods  are  then  used  to  reduce  the  uncertainty  in 
fault  isolation  due  to  incorrectly-generated  symbols.  An  ap¬ 
proach  similar  to  our  work  is  presented  in  (Ying,  Kirubarajan, 
Pattipati,  &  Patterson-Hine,  2000),  in  the  sense  that  a  proba¬ 
bilistic  solution  is  used  to  perform  fault  diagnosis  in  systems 
with  imperfect  diagnosis  tests.  However,  the  diagnosis  ap¬ 
proach  and  the  probabilistic  solution  are  different  than  those 
used  in  this  paper. 

The  remainder  of  the  paper  is  organized  as  follows.  Sec¬ 
tion  2  formulates  the  problem  for  event-based  fault  isolation. 
Section  3  reviews  the  standard  event-based  fault  isolation  ap¬ 
proach,  and  Section  4  extends  the  approach  to  be  observation- 
robust.  Section  5  describes  implementations  of  the  standard 
and  robust  frameworks  based  on  qualitative  fault  isolation, 
and  presents  the  case  study  and  results.  Section  6  concludes 
the  paper  and  discusses  future  work. 

2.  Problem  Formulation 

In  this  section,  we  define  the  fault  isolation  problem  that  we 
aim  to  solve.  We  assume  an  event-based  fault  isolation  frame¬ 
work,  where  faults  are  isolated  based  on  the  analysis  of  a 
sequence  of  observable  events  produced  as  a  result  of  the 
fault  occurrence  (where,  in  the  nominal  case,  no  such  events 
are  produced).  The  approach  is  related  to  discrete-event  di¬ 
agnosis  (Sampath,  Sengupta,  Lafortune,  Sinnamohideen,  & 
Teneketzis,  1996)  and,  more  closely,  the  concept  of  chroni¬ 


cles  (Cordier  &  Dousson,  2000).  For  the  purposes  of  defining 
the  problem  and  describing  the  fault  isolation  approach,  we 
present  a  generalized  theoretical  framework  for  event-based 
fault  isolation.  In  Section  5,  we  will  describe  a  specific  im¬ 
plementation  of  this  framework  for  dynamic  systems  (Daigle 
et  al.,  2009). 

First,  we  have  the  set  of  faults,  F,  that  may  occur  in  the  sys¬ 
tem.  Faults  produce  observable  events,  called  fault  signa¬ 
tures. 

Definition  1  (Fault  Signature).  A  fault  signature  for  a  fault  / 
denoted  by  07,  is  an  event  that  is  observed  as  a  consequence 
of  the  occurrence  of  /.  The  set  of  fault  signatures  for  /  is 
denoted  as  E/ .  The  set  of  fault  signatures  over  a  set  of  faults 
F  is  denoted  as  T,p,  i.e..  Up  =  |^J  E/. 

feF 

These  events  are  produced  in  some  temporal  order.  A  fault 
trace  is  a  one  particular  fault  signature  sequence  that  may  be 
observed. 

Definition  2  (Fault  Trace).  A  fault  trace  for  a  fault  /  denoted 
by  A/,  is  a  sequence  of  fault  signatures  from  E /  resulting 
from  the  occurrence  of  f. 

Definition  3  (Maximal  Fault  Trace).  A  fault  trace  A  /  for  a 
fault  /  is  maximal  if  there  is  no  extension  A/07  that  is  also  a 
fault  trace  for  /. 

The  set  of  all  possible  maximal  fault  traces  for  a  fault  is  called 
its  fault  language. 

Definition  4  (Fault  Language).  The.  fault  language  of  a  fault 
/  £  F  denoted  by  Lf,  is  the  set  of  all  maximal  fault  traces 
for  f.  The  union  of  fault  languages  for  a  set  of  faults  F  is 
denoted  as  Lp,  i.e.,  Lp  =  |^J  Lf. 

feF 

We  assume  that  we  have  considered  all  possible  faults  in  F, 
and  that  the  fault  languages  are  complete. 

Assumption  1  (Completeness  of  F).  We  assume  that  F  is 
complete,  i.e.,  there  is  no  other  fault  /  f.  F  that  can  occur. 
Assumption  2  (Completeness  of  Lf).  We  assume  that  for 
every  fault  f  £  F,  Lf  is  complete,  i.e.,  there  is  no  other 
maximal  fault  trace  A /  ^  Lf  that  may  occur  as  a  result  of  /. 

By  Assumptions  1  and  2,  whenever  some  fault  trace  A  oc¬ 
curs,  it  must  have  been  produced  by  some  fault  /  £  F,  and 
it  must  belong  to  Lf  for  at  least  one  /  €  F.  These  assump¬ 
tions  are  quite  standard  in  model-based  diagnosis.  In  some 
approaches,  e.g.,  (Hofbaur  &  Williams,  2002;  Narasimhan 
&  Brownston,  2007),  an  unknown  fault  is  considered,  which 
is  consistent  with  everything.  In  our  approach,  such  a  fault 
could  be  included  by  adding  a  new  /  where  L /  contains  all 
possible  traces. 

So,  associated  with  each  fault  is  a  set  of  fault  traces,  where 
the  maximal  fault  traces  are  collected  into  a  fault  language. 
When  a  fault  occurs,  a  specific  event  sequence  will  be  ob¬ 
served  that  belongs  to  the  fault  language.  In  this  framework. 
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Algorithm  If*  f-  Faultlsolation(F) 

1:  F*  4—  F 
2:  A  4-  0 

3:  while  <ji  observed  do 

4;  A  4 —  A (T i 

5:  F*  4—  FindConsistentFaults(f1*,  A) 

6:  end  while 


fault  isolation  reduces  to  matching  observed  fault  traces  to 
predicted  fault  traces,  to  determine  which  fault  has  occurred. 
So,  the  fault  isolation  problem  is  defined  as  follows. 
Problem.  Given  an  observed  fault  trace.  A,  find  the  most 
likely  single  fault  /  that  produced  A. 

Here,  we  aim  to  find  the  most  likely  fault,  because  the  ob¬ 
served  fault  trace  may  not  always  be  generated  correctly,  due 
to  various  reasons,  such  as  improperly  tuned  quantitative  sig¬ 
nal  thresholds.  If  this  is  the  case,  we  must  find  the  most 
likely  fault  that  explains  the  (incorrectly)  observed  trace,  be¬ 
cause  the  observed  trace  may  not  be  found  in  any  Lf.  The 
standard  fault  isolation  approach  (Section  3)  assumes  the  ob¬ 
served  trace  is  always  correct,  whereas  the  new  robust  ap¬ 
proach  (Section  4)  does  not  make  that  assumption,  in  order  to 
handle  incorrectly  observed  fault  traces  in  a  robust  fashion. 

3.  Event-Based  Fault  Isolation 

In  the  standard  fault  isolation  approach,  we  assume  that  fault 
traces  are  correctly  observed. 

Assumption  3.  All  observed  fault  signatures  are  correct,  i.e., 
if  fault  signature  a  occurs,  it  is  observed  as  cr. 

Therefore,  given  Assumptions  1-3,  when  a  fault  occurs  and 
we  observe  a  fault  trace,  this  trace  must  belong  to  the  fault 
language  of  at  least  one  fault.  The  function  of  the  fault  iso¬ 
lation  algorithm  is  simply  to  find  which  faults  are  consistent 
with  the  observed  fault  trace. 

The  fault  isolation  algorithm  is  presented  as  Algorithm  1.  Ini¬ 
tially,  the  set  of  isolated  faults,  F* ,  is  set  to  the  complete  set 
of  faults,  F.  The  initial  observed  fault  trace  A  is  the  empty 
event  sequence.  While  new  fault  signatures  are  observed,  we 
update  the  observed  fault  trace,  and  reduce  F*  to  the  set  of 
faults  consistent  with  the  new  trace. 

The  FindConsistentFaults  algorithm,  presented  as 
Algorithm  2,  eliminates  from  F*  faults  that  are  no  longer  con¬ 
sistent  with  the  trace  extended  with  er,  .  A  fault  /  is  consistent 
with  an  observed  trace  A  if  there  is  a  fault  trace  A/  in  its  fault 
language  where  A  is  a  prefix  (C),  i.e.,  the  fault  can  generate 
the  observed  sequence  of  events  so  far.  If  the  fault  is  indeed 
consistent,  it  is  retained,  otherwise,  it  is  removed  from  F*. 

Basically,  we  continue  to  observe  new  symbols,  and  F*  re¬ 
duces.  If  the  system  is  diagnosable,  i.e.,  all  faults  are  distin¬ 
guishable  from  each  other  (via  their  fault  languages),  then  F* 
will  reduce  to  a  single  fault.  A  fault  J)  is  distinguishable  from 


Algorithm  2  F*  4—  FindConsistentFaults(T1*,  A) 

1 :  for  all  /  €  F*  do 

2:  if  -i  exist  A  /  €  Lj  such  that  A2A  /  then 

3:  F*  4-  F*  -  {/} 

4:  end  if 

5:  end  for 


fj  in  this  framework  if  there  is  no  trace  in  C  f  that  is  a  prefix 
of  a  trace  in  C  f. . 

Example  1.  Consider  a  set  of  three  faults,  F  =  {fi,  f 2,  f 3}, 
where  Lj1  =  { cab,acb },  Lf2  =  {abc,bac},  and  Lf3  = 
{cb,ca,ab}.  Say  that  we  observe  first  the  fault  signature  a. 
Each  of  the  faults  may  produce  a  as  the  first  fault  signature, 
so  F*  =  {fi,  /2,  /a}.  Say  we  next  observe  b.  Now,  fi  can¬ 
not  produce  a  trace  starting  with  ab,  so  it  is  eliminated,  and 
F*  =  {/2,  /s}-  Say  we  next  observe  c.  Now,  fa  cannot  pro¬ 
duce  a  trace  beginning  with  abc,  and  so  /2  is  isolated  as  the 
fault. 

Let  us  say  we  observe  a  trace  that  does  not  belong  to  any 
fault  language.  There  are  three  explanations  for  this:  (i)  an 
unknown  fault  has  occurred  (violation  of  Assumption  1),  (ii) 
a  valid  trace  is  missing  from  a  fault  language  (violation  of 
Assumption  2),  or  {iii)  the  trace  was  observed  incorrectly  (vi¬ 
olation  of  Assumption  3).  For  (i)  and  (ii),  there  is  nothing  that 
can  be  done,  so  we  limit  ourselves  only  to  situation  (iii).  So, 
what  happens  when  the  trace  is  observed  incorrectly? 
Example  2.  Consider  again  the  fault  set  from  the  previous 
example.  Say  we  observe  c,  then  we  have  F*  =  {/i,/3}. 
Say  we  then  observe  b,  then  we  have  F*  =  1/3}.  Say  we 
then  observe  a,  then  we  have  F*  =  0,  i.e,  all  faults  were 
eliminated.  One  explanation  is  that  the  a  fault  signature  was 
falsely  observed  (i.e.,  a  false  alarm),  in  which  case  the  true 
fault  is  /3. 

The  result  of  an  incorrectly  observed  trace  is  an  incorrect  fault 
isolation  result.  Either  all  candidates  will  be  eliminated,  as  in 
the  example  above,  or  the  wrong  fault  will  be  isolated  (if  the 
observed  trace  belongs  to  a  fault  language  of  a  fault  that  did 
not  occur).  In  practice,  it  is  not  unlikely  that  a  trace  may  be 
incorrectly  observed,  e.g.,  from  noisy  sensor  signals,  overly 
sensitive  fault  detection  thresholds,  etc.  Clearly,  Algorithm  1 
is  not  robust  in  this  case.  A  more  robust  approach  is  necessary 
to  handle  a  violation  of  Assumption  3. 

4.  Robust  Event-Based  Fault  Isolation 

As  described  in  Section  3,  Algorithm  1  makes  Assumption  3, 
i.e.,  there  is  only  one  interpretation  of  an  observed  trace, 
which  is  what  was  observed.  In  practice,  however,  traces  may 
be  incorrectly  observed,  and  so  we  must  drop  Assumption  3 
in  order  to  be  robust  to  this  situation,  i.e.,  to  make  the  ap¬ 
proach  observation-robust.  In  more  detail,  by  observation- 
robust,  we  mean  that  the  approach  performs  optimally  when 
all  observations  are  correct,  and  its  performance  degrades 
gracefully  as  the  number  of  incorrect  observations  increases. 
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In  practical  terms,  this  means  that  the  true  fault  is  diagnosed 
to  have  the  highest  probability  of  being  the  one  that  occurred, 
when  all  observations  are  correct.  Further,  its  assigned  proba¬ 
bility  decreases  when  incorrect  observations  are  encountered, 
where,  up  to  a  certain  point,  it  remains  the  most  probable  fault 
given  the  observations. 

In  order  to  still  perform  in  the  face  of  incorrect  observations, 
we  must  differentiate  between  an  observed  trace  and  an  in¬ 
terpreted  trace.  For  a  given  observed  trace,  there  are  several 
potential  interpreted  traces.  An  observed  trace  may  or  may 
not  belong  to  any  Lf.  Any  valid  interpretation  of  it,  however, 
must  be  a  prefix  of  some  trace  in  Lp.  That  is,  given  an  ob¬ 
served  trace,  we  must  generate  all  correct  ways  to  interpret  it, 
given  the  set  of  considered  faults.  Each  interpreted  trace  will 
have  its  own  probability  and  its  own  diagnosis.  Given  the  set 
of  interpreted  traces,  their  probabilities,  and  their  diagnoses, 
we  can  extract  a  combined  diagnosis  that  provides,  for  every 
fault  resulting  from  an  interpreted  trace,  a  probability  of  its 
occurrence. 

Say  that  so  far  we  have  an  interpreted  trace  of  A,  and  a  new 
symbol  <rt  is  observed.  How  do  we  extend  A  given  rr,  ?  We 
assume  there  is  a  known  set  of  signatures,  E <Ti,  that  can  be 
observed  as  cr,;.  At  a  minimum,  this  set  contains  <Ji  itself.  So, 
when  cr,;  is  observed,  it  could  have  been  any  signature  in  ECTi 
that  actually  occurred.  However,  only  a  subset  of  these  can 
extend  A  and  be  consistent  with  a  given  set  of  faults.  To  be 
consistent,  they  have  to  be  a  prefix  of  some  trace  found  in  L p 
(since  an  interpreted  trace  must  belong  to  Lp). 

Example  3.  Consider  again  the  set  of  three  faults,  F  = 
{/u/2,/3}.  where  Lfl  =  { cab,acb },  Lh  =  { abc,bac }, 
and  Lf3  —  {cb,ca,ab}.  Say  that  E0  =  {a,  6},  Ef,  =  {b,  a}, 
and  Ec  =  {c}.  Say  that  the  trace  bca  is  observed,  what  are 
the  possible  interpreted  traces?  First  b  is  observed  and  that 
can  be  interpreted  as  either  a  or  b;  so  far  the  interpreted  traces 
are  a  and  b.  Next  c  is  observed,  which  can  be  interpreted 
only  as  c;  so  the  interpreted  traces  are  ac  and  be.  Then  a  is 
observed,  which  can  be  interpreted  as  either  a  or  b,  so  the  po¬ 
tential  interpreted  traces  are  aca,  acb,  bca,  beb,  however,  only 
acb  belongs  to  a  fault  language  and  is  valid. 

Eg.,  may  also  contain  special  signatures  that  represent  false 
alarms,  which  we  denote  using  e  with  a  subscript  denoting 
the  event  associated  with  the  false  alarm  (e.g.,  ea  for  a  false 
alarm  of  event  a).  For  example,  we  could  observe  some  sig¬ 
nature  a,  but  it  may  be  possible  that  no  signature  occurred  and 
<7  is  to  be  interpreted  as  a  false  alarm.  In  this  case,  we  require 
a  special  false  alarm  signature.  The  fault  languages  must  in¬ 
clude  traces  that  contain  false  alarm  signatures  in  order  for 
them  to  be  interpreted  from  an  observed  trace.  Note  that  such 
signatures  are  not  required  for  the  standard  approach  due  to 
Assumption  3.  We  require  also  a  false  alarm  “fault”  to  be 
included  in  F,  for  which  its  traces  contain  only  false  alarm 
signatures.  It  is  not  actually  a  fault  but  used  to  represent  the 


situation  where  so  far,  only  false  alarm  signatures  have  been 
interpreted  from  the  observed  signatures. 

Example  4.  Consider  the  same  situation  as  in  the  previous 
example,  except  with  false  alarm  signatures  ea,  e&,  and  ec. 
The  fault  languages  are  extended  by  traces  where  a,  b,  and 
c  can  be  replaced  with  these  signatures,  respectively,  e.g., 
Lf} ,  in  addition  to  cab,  has  ecab,  ceab,  and  catb,  as  well 
as  eacb,  ebca,  eaebC,  etc.  Here,  we  have  Ea  =  {a,  b ,  ea}, 
E&  =  {6,  a,  £{,},  and  Ec  =  {c,  ec}.  We  require  then  also  the 
false  alarm  fault  E,  which  has  all  traces  of  the  three  signa¬ 
tures  ea,  e&,  and  ec.  Say  again  that  the  trace  bca  is  observed, 
what  are  the  possible  interpreted  traces?  First  b  is  observed 
and  that  can  be  interpreted  as  either  a,  b,  or  a  false  alarm  in 
b,  £/,.  Then  c  is  observed  which  is  really  either  c  or  ec,  so 
the  potential  interpreted  traces  are  ac,  aec,  bec,  CbC,  e&ec  {be  is 
not  included  since  it  does  not  belong  to  any  fault  language). 
Next  a  is  observed  which  is  either  a,  b,  or  ea-  The  interpreted 
traces  are  then  acb,  aecb,  beca,  bccea,  CbCa,  ebcea,  Cbeca,  and 

The  algorithm  for  robust  fault  isolation  is  given  as  Algo¬ 
rithm  3.  We  keep  a  set  of  tuples,  C,  containing  an  interpreted 
trace  A,  its  probability  p,  and  its  diagnosis  F*.  Initially,  the 
set  contains  only  one  tuple,  which  is  the  empty  trace  e,  with  a 
probability  of  1  and  the  complete  fault  set  F  as  its  diagnosis. 
When  a  new  signature  a,  is  observed  (In.  2),  we  go  through 
each  interpreted  trace  A.  First,  we  find  all  new  signatures  that 
would  (i)  belong  to  Effj,  and  (ii)  can  extend  A  to  produce  a 
valid  fault  trace  (In.  5).  For  each  of  these  possible  next  signa¬ 
tures,  we  extend  the  trace  with  it  (In.  7),  assign  the  new  trace’s 
probability  (Ins.  8-15),  and  obtain  its  diagnosis  (In.  16).  We 
then  add  the  new  tuple  (A ',p',  F*)  to  the  set  of  new  tuples  C! 
(In.  17),  which  replaces  C  (In.  20).  Finally,  we  construct  the 
merged  diagnosis  T* ,  which  is  a  set  of  tuples  of  a  fault  and 
its  probability. 

To  compute  the  probability  of  a  trace,  we  assume  that  there  is 
a  probability  of  observing  the  correct  signature,  pc.  We  can 
compute  the  probability  of  the  interpreted  signature,  pa,  as  pc 
if  it  matches  the  observed  signature  <jj.  If  it  does  not  match, 
we  assume  that  all  other  signatures  are  equally  probable,  so  it 
is  assigned  as  (1  —  pc)/(|E|  —  1)  if  cr,;  is  possible  to  observe, 
and  1/ 1 E  |  if  not.  The  probability  of  the  trace  extended  by  a  is 
then  the  probability  of  the  original  trace  times  the  probability 
of  <7. 

The  diagnosis  that  is  merged  over  all  traces  is  computed  as 
described  in  Algorithm  4.  Each  fault  is  assigned  initially  a 
probability  of  0.  Then,  for  each  interpreted  trace,  the  proba¬ 
bility  of  the  fault  given  that  trace,  p(f\  A),  is  computed  as  the  a 
priori  probability  of  the  fault  divided  by  the  sum  of  the  proba¬ 
bilities  of  that  fault  diagnosed  for  that  trace.  This  probability 
is  then  added  to  the  probability  of  the  fault,  p{f).  After  going 
through  all  traces,  each  fault  is  assigned  its  total  probability. 
The  set  T*  is  created  by  adding  tuples  for  all  faults  and  their 
probabilities. 


350 


Annual  Conference  of  the  Prognostics  and  Health  Management  Society  2014 


Algorithm  3  .F*  <—  RobustFaultlsolation(F) 

1:  £^{(e,l,F)} 

2:  while  <j,  observed  do 

3:  £'  <- 0 

4:  for  all  (X,p,  F*)  £  £  do 

5:  S  (-  jff  :  <7  6  S„  and  exists  A  £  LF *  such  that  A  a  C 

A} 

6:  for  all  a  £  S  do 

7:  A'  £-  Act 

8:  if  a  =  Oi  then 

9:  p„  <—  pc 

10:  else  if  cu  £  E  then 

11:  Pa  <r-  (1  -Pc)/(|S|  -  1) 

12:  else 

13:  Per  <  l/|Ej 

14:  end  if 

15:  p'  -S-  p  •  pCT 

16:  F*  <r-  FindConsistentFaults(F*,  A*) 

17:  £'<-C'U{(\ ',p'F*)} 

18:  end  for 

19:  end  for 

20:  £  <r-  £' 

21:  £  <—  Prune(F) 

22:  F*  <—  ConstructF(F,  £) 

23:  end  while 


Algorithm  4  T*  <—  ConstructF (F,£) 

1:  F*  <-  0 

2:  for  all  f  £  F  do 

3:  p(/)  <-  0 

4:  end  for 

5:  for  all  (A,p,  F*)  £  £  do 

6:  for  all  /  £  F*  do 

7:  p(/|A)< - 

8:  P(f)^P{f)+P'P(f  I  A) 

9:  end  for 

10:  end  for 
11:  for  all  f  £  F  do 

12:  F*^F*U{(/,p(/))} 

13:  end  for 


Clearly,  the  number  of  interpreted  traces,  in  the  worst  case, 
grows  exponentially  with  each  new  observed  symbol.  Each 
new  symbol  can  be  interpreted  in  a  number  of  ways  and  all 
current  interpreted  traces  need  to  be  extended  with  all  pos¬ 
sible  interpretations.  In  order  to  control  the  computational 
complexity  of  the  algorithm,  a  pruning  step  is  added  (In.  21). 
Interpreted  traces  may  be  removed  from  £  by,  for  example, 
keeping  only  the  N  most  probable  traces,  or  keeping  only 
traces  above  a  probability  threshold  pQ.  After  removing  traces 
from  C,  the  trace  probabilities  must  be  normalized. 

Example  5.  Consider  again  the  scenario  in  the  previous  ex¬ 
ample.  The  diagnostic  tree  is  shown  in  Fig.  1.  Initially,  any 
of  the  faults  are  possible,  including  the  false  alarm  fault  E. 
The  branches  in  the  tree  represent  the  possible  interpreted 
traces  from  the  observed  trace  bca.  The  standard  approach 
would  have  only  one  branch.  We  assume  that  pc  =  0.9, 
and  the  arrows  are  labeled  with  the  interpreted  symbol  and 
its  probability,  leading  to  the  new  diagnosis  and  its  proba¬ 
bility.  Since  bca  does  not  belong  to  any  fault  language,  the 


standard  approach  would  fail,  whereas  in  this  approach,  we 
have  many  potential  diagnoses  that  are  ranked  probabilisti¬ 
cally,  depending  on  the  probabilities  assigned  to  the  inter¬ 
preted  symbols.  For  example,  take  the  leftmost  branch,  where 
b  is  correctly  observed.  This  happens  with  90%  probabil¬ 
ity,  and  immediately  leads  to  1/2}  as  the  diagnosis,  since 
no  other  fault  can  produce  a  b  as  the  first  signature.  Then 
c  is  observed.  Since  there  is  no  fault  that  can  produce  be, 
the  only  valid  interpretation,  given  that  b  was  correctly  ob¬ 
served,  is  that  c  was  incorrectly  observed  and  the  interpreted 
signature  is  ec,  i.e.,  a  false  alarm  of  symbol  c.  Then  a  is  ob¬ 
served,  which  can  be  interpreted  only  as  a  or  ea,  but  not  as 
b  since  no  fault  produces  two  b  signatures  in  any  trace.  In 
either  case,  the  diagnosis  remains  The  rightmost  branch, 
on  the  other  hand,  represents  the  case  where  all  observations 
were  false  alarms,  and  thus  the  diagnosis  is  E.  For  a  given 
fault,  its  total  probability  over  all  interpreted  traces  can  be 
computed.  If  we  assume  that  all  faults  are  equally  likely,  then 
p{f2\bca)  =  0.81  +  0.09  +  0.005/3  +  0.0045/3  =  0.9032. 

Clearly,  the  selection  of  values  for  pc  and  pQ  will  determine 
the  final  computed  probabilities  of  candidates  for  a  given  ob¬ 
served  trace.  A  higher  value  of  pc  will  assign  a  higher  prob¬ 
ability  to  the  most  consistent  candidates  and  a  lower  value 
to  the  remaining  candidates,  i.e.,  the  candidate  probability 
distribution  will  have  a  smaller  variance.  Similarly,  a  lower 
value  of  pc  will  cause  the  candidate  probability  distribution 
to  have  a  larger  variance.  If  pQ  is  too  high,  and  a  trace  is 
incorrectly  observed,  then  it  is  possible  that  the  correct  can¬ 
didate  can  be  eliminated.  Therefore,  both  pc  and  pa  have  to 
be  selected  to  best  represent  the  confidence  in  the  symbol  ob¬ 
servation  process. 

5.  Case  Study 

In  this  section,  we  describe  the  application  of  the  new  robust 
event-based  fault  isolation  framework  to  ADAPT.  We  use  the 
qualitative  event-based  fault  isolation  (QFI)  framework  de¬ 
veloped  in  (Daigle  et  ah,  2009)  and  apply  the  robust  method¬ 
ology  to  it.  We  first  describe  the  QFI  framework  and  how  it 
maps  into  the  general  event-based  framework  described  ear¬ 
lier,  then  describe  the  ADAPT  system.  Finally,  we  describe 
experimental  results  using  data  from  ADAPT. 

5.1.  Qualitative  Event-Based  Fault  Isolation 

In  the  QFI  framework  in  (Mosterman  &  Biswas,  1999;  Daigle 
et  ah,  2009),  signatures  capture  qualitative  deviations  in  mag¬ 
nitude  and  slope  of  residual  signals,  where  a  residual  is  com¬ 
puted  as  the  difference  between  a  measured  value  of  a  sen¬ 
sor  and  its  expected  (model-predicted)  value.  So,  for  a  given 
residual  r,  we  can  have  six  different  signatures:  (i)  an  increase 
in  magnitude,  (ii)  a  decrease  in  magnitude,  (iii)  an  increase  in 
slope,  (iv)  a  decrease  in  slope,  (v)  a  false  alarm  in  the  mag¬ 
nitude,  and  (vi)  a  false  alarm  in  the  slope.  For  each  poten¬ 
tial  fault,  we  can  use  a  dynamic  system  model  to  determine 
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Figure  1 .  Example  diagnostic  tree. 


which  signatures  are  possible,  as  described  in  (Mosterman  & 
Biswas,  1999). 

Fault  traces  in  this  framework  obey  a  certain  set  of  con¬ 
straints.  First,  for  a  given  residual  r,  the  magnitude  sym¬ 
bol  must  always  be  observed  before  the  slope  symbol,  and 
magnitude  and  slope  symbols  can  be  observed  only  once  per 
residual  (including  false  alarm  signatures).  Second,  the  order 
of  signatures  between  residuals  must  respect  relative  resid¬ 
ual  orderings  (Daigle,  Koutsoukos,  &  Biswas,  2007),  which 
express  the  intuition  that  faults  manifest  in  some  residuals 
before  others.  Like  signatures,  these  can  be  derived  from  a 
dynamic  system  model  (Daigle,  2008).  Third,  once  a  false 
alarm  signature  occurs  for  the  magnitude,  we  cannot  observe 
any  more  signatures  for  that  residual.  Aside  from  these  re¬ 
strictions,  false  alarms  can  occur  at  any  time.  In  this  frame¬ 
work,  fault  traces  do  not  need  to  be  precomputed  but  can  be 
computed  online  (Daigle  et  al.,  2009). 

More  information  on  this  framework  and  its  implementation 
may  be  found  in  (Daigle,  Roychoudhury,  &  Bregon,  2013; 
Daigle,  Bregon,  &  Roychoudhury,  201 1).  For  the  purposes  of 
this  paper,  it  suffices  to  say  that  we  build  a  dynamic  model  in 
order  to  compute  residuals,  and  these  are  analyzed  in  a  statis¬ 
tical  manner  to  generate  observed  signatures.  This  involves 
the  use  of  thresholds  on  the  residuals.  The  major  practical 
problem  here  is  tuning  of  the  thresholds,  which  can  be  time- 
consuming  in  order  to  achieve  the  desired  false  alarm/missed 
detection  trade-off.  If  these  are  not  perfectly  tuned,  signatures 
can  be  incorrectly  generated.  In  practice,  this  is  quite  difficult, 
so,  using  an  approach  that  is  robust  to  incorrect  signatures  is 
much  desired.  We  compare  two  different  diagnosers,  (i)  the 
QED  algorithm,  which  implements  the  Fault  Isolation 
algorithm;  and  (ii)  probabilistic  QED  (pQED),  which  imple¬ 
ments  the  RobustFaultlsolation  algorithm.  Except 


for  the  fault  isolation  algorithm,  the  two  diagnosers  are  the 
same. 

5.2.  ADAPT 

In  this  paper,  we  apply  our  new  methodology  to  the  Advanced 
Diagnostics  and  Prognostics  Testbed  (ADAPT),  an  electrical 
power  distribution  system  that  is  representative  of  those  on 
spacecrafts.  ADAPT  serves  as  a  testbed  through  which  faults 
can  be  injected  to  evaluate  diagnostic  algorithms  (Poll  et  al., 
2007).  ADAPT  has  been  established  as  a  diagnostic  bench¬ 
mark  system  through  the  industrial  track  of  the  International 
Diagnostic  Competition  (DXC)  (Kurtoglu  et  al.,  2009;  Poll 
et  al.,  2011;  Sweet  et  al.,  2013).  In  particular,  this  paper  is 
focused  on  diagnosing  faults  on  a  subset  of  ADAPT,  called 
ADAPT-Lite. 

A  system  schematic  for  ADAPT-Lite  is  given  in  Fig.  2.  A 
battery  (BAT2)  supplies  electrical  power  to  several  loads, 
transmitted  through  several  circuit  breakers  (CB236,  CB262, 
CB266,  and  CB280)  and  relays  (EY244,  EY260,  EY281, 
EY272,  and  EY275),  and  an  inverter  (INV2)  that  converts  dc 
to  ac  power.  ADAPT-Lite  has  one  dc  load  (DC485)  and  two 
ac  loads  (AC483  and  FAN416).  There  are  sensors  throughout 
the  system  to  report  electrical  voltage  (names  beginning  with 
“E”),  electrical  current  (“IT”),  and  the  positions  of  relays  and 
circuit  breakers  (“ESH”,  “ISH”).  Finally  there  is  one  sensor 
to  report  the  operating  state  of  a  load  (fan  speed,  “ST”)  and 
another  to  report  the  battery  temperature  (“TE”).  Models  and 
additional  details  for  ADAPT-Lite  can  be  found  in  (Daigle  et 
al.,  2011,2013). 

Our  list  of  potential  faults  includes  failures  in  the  relays,  cir¬ 
cuit  breakers,  fan,  DC  load,  and  AC  load.  We  consider  also 
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under-  and  over-speed  faults  of  the  fan,  and  offset,  drift,  and 
intermittent  offset  faults  in  the  DC  and  AC  loads. 

5.3.  Experiments 

Using  scenarios  available  from  the  DXC,  we  ran  QED  and 
pQED  on  a  set  of  30  nominal  scenarios  and  71  fault  scenar¬ 
ios.  The  same  fault  detectors  were  used  for  both  algorithms, 
so  that  we  can  show  that,  when  incorrect  signatures  are  gen¬ 
erated,  pQED  performs  better  than  QED,  with  the  same  in¬ 
formation.  The  settings  are  nonoptimal  in  order  to  better 
highlight  the  differences  in  the  approaches  when  multiple  in¬ 
correct  observations  are  encountered;  improving  the  settings 
would  of  course  improve  the  performance  of  both  algorithms, 
but  make  it  harder  to  compare  the  performance  in  nonoptimal 
conditions. 

We  first  consider  an  example  scenario,  to  illustrate  the  dif¬ 
ferent  diagnosis  approaches.  We  then  summarize  the  perfor¬ 
mance  of  the  approaches  over  all  scenarios. 

As  an  example,  consider  a  resistance  drift  fault  in  AC483. 
The  fault  is  injected  at  60  s  and  detected  at  63  s  with  a  de¬ 
crease  in  IT240.  QED  reduces  the  candidate  list  to  a  failure 
in  AC483,  a  positive  resistance  offset  in  AC483,  a  positive  re¬ 
sistance  drift  in  AC483,  a  failure  in  CB236,  CB262,  CB266, 
EY244,  and  DC485,  a  resistance  increase  in  DC485,  a  resis¬ 
tance  drift  in  DC485,  a  failure  in  EY244,  EY260,  EY272, 
EY275,  EY284,  FAN416,  an  under-speed  fault  in  FAN416, 
and  a  failure  in  INV2.  A  -  signature  for  the  slope  of  the  IT240 
residual  is  then  computed,  for  which  only  the  drift  faults  are 
consistent.  An  increase  in  E242  is  detected  at  120  s,  followed 
by  the  generation  of  a  +  signature  for  its  slope.  QED  elim¬ 
inates  all  faults,  because  it  expects  IT267  to  deviate  before 
E242.  On  the  other  hand,  pQED  retains  the  drift  faults  as  can¬ 
didates,  but  lowers  their  probabilities.  Before  the  E242  devia¬ 
tion,  the  two  drift  faults  had  a  probability  of  38.77%  each.  Af¬ 
ter,  the  probability  reduces  to  3.92%,  and  they  are  still  at  the 
top  of  the  candidate  list.  With  the  subsequent  signatures  for 
E242,  probability  decreases,  as  this  is  more  evidence  of  other 
potential  faults,  but  they  remain  the  most  probable.  However, 


then  E240  deviates,  again  before  IT267  as  expected,  and  this 
reduces  their  probability  further,  and  they  drop  to  the  eighth 
and  ninth  most  probable  (at  this  point  it  is  more  likely  that 
the  detection  of  a  negative  slope  (rather  than  no  change  in 
slope)  was  incorrect,  and  so  failures  in  the  circuit  breakers 
and  relays  become  more  likely).  In  this  case,  no  deviation 
was  detected  in  IT267.  With  a  more  sensitive  threshold,  a  de¬ 
viation  in  IT267  could  have  been  detected  first,  and  the  drift 
faults  would  have  remained  the  most  probable.  Although  this 
is  not  the  most  optimal  result,  at  least  the  true  fault  was  con¬ 
tained  in  the  final  diagnosis,  albeit  not  at  the  highest  level  of 
probability. 

5.3.1.  Summary  of  Results 

Over  the  nominal  scenarios,  both  algorithms  (since  they  use 
the  same  fault  detectors)  correctly  detected  a  fault  (true  pos¬ 
itives)  69  of  71  times,  with  2  missed  detections  (false  nega¬ 
tives).  There  were  no  false  alarms  detected. 

For  the  fault  scenarios,  QED  ends  with  a  list  of  candidates  that 
are  consistent  with  the  observed  symbols.  Ideally,  this  list  is 
a  singleton,  containing  the  true  fault.  If,  given  the  available 
diagnostic  information,  this  is  not  possible,  then  we  desire 
that  it  has  the  true  fault  in  its  final  candidate  list.  In  fact, 
QED  never  obtains  the  true  fault  as  the  single  candidate,  as 
diagnosability  is  not  high  enough  to  achieve  that  condition. 

QED  has  the  correct  fault  in  its  candidate  list  in  24  of  69 
scenarios.  This  means  that  there  are  incorrect  signatures  gen¬ 
erated  in  at  least  45  scenarios.  This  can  be  improved  with 
better  fault  detector  tuning,  however  we  keep  these  settings 
in  order  to  demonstrate  the  improvement  pQED  provides.  In 
32  of  these  45  scenarios,  QED  actually  eliminates  all  faults, 
as  no  faults  were  consistent  with  the  (incorrect)  observations. 

For  pQED,  we  used  pc  =  90%,  and  pruned  candidates  with 
probability  less  than  0.1%.  If  pQED  does  not  prune,  then 
it  will  always  have  the  correct  candidate  in  its  candidate  list 
(but  perhaps  with  a  low  probability  assignment).  With  the 
pruning  threshold  used,  pQED  has  the  correct  candidate  in 
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its  final  list  63  of  69  times,  which  is  a  significant  improve¬ 
ment  over  QED.  For  the  6  times  in  which  it  did  not  have  the 
true  fault,  there  were  too  many  incorrect  observations,  bring¬ 
ing  down  the  probability  of  the  true  fault  low  enough  that  all 
traces  containing  the  fault  were  pruned. 

Of  course,  it  is  not  enough  the  pQED  has  the  correct  fault 
in  its  list,  as  this  depends  solely  on  the  pruning  threshold. 
We  are  interested  in  the  probability  assignment  of  the  true 
fault  within  the  final  candidate  list.  pQED  diagnoses  the  true 
fault  as  the  fault  with  highest  probability  38  of  69  times.  This 
is  better  than  the  24  of  69  times  for  QED.  Since  QED  does 
not  rank  its  final  candidates,  pQED’s  result  is  actually  signif¬ 
icantly  better  and  more  useful.  For  the  times  when  the  true 
fault  is  not  ranked  the  highest,  it  is  at  least  contained  in  the 
final  candidate  list  for  most  of  the  time. 

6.  Conclusions 

In  this  paper,  we  presented  a  robust  approach  to  event-based 
fault  isolation  that  drops  the  observation  correctness  assump¬ 
tion  in  order  to  improve  robustness  of  fault  isolation  when 
events  are  incorrectly  observed.  We  applied  this  framework 
to  a  qualitative  event -based  fault  isolation  framework.  Exper¬ 
iments  using  real  data  from  an  electrical  power  system  testbed 
demonstrated  the  approach  and  its  improved  robustness. 

Future  work  will  focus  on  extending  the  approach  to  multiple 
fault  isolation,  and  extending  the  probability  framework  to 
account  for  conditional  probabilities. 
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